Sequential composition of schema mappings

ABSTRACT

A method for generating a schema mapping. A provided mapping M 12  relates schema S 1  to schema S 2 . A provided mapping M 23  relates schema S 2  to schema S 3 . A mapping M 13  is generated from schema S 1  to schema S 3  as a composition of mappings M 12  and M 23 . Mappings M 12 , M 23 , and M 13  are each expressed in terms of at least one second-order nested tuple-generating dependency (SO nested tgd). Mapping M 13  does not expressly recite any element of schema S 2 . At least one schema of the schemas S 1  and S 2  may comprise at least one complex type expression nested inside another complex type expression. Mapping M 13  may define the composition of the mappings M 12  and M 23  with respect to a relationship semantics or a transformation semantics.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates to a method and system for generating aschema mapping that composes two given schema mappings.

2. Related Art

Compositional mappings between schemas may be expressed as constraintsin logic-based languages. However, there is no known compositionalmapping that applies to Extensible Markup Language (XML) schemas.Accordingly, there is a need for a compositional mapping that applies toXML schemas.

SUMMARY OF THE INVENTION

The present invention provides a method for generating a schema mapping,said method comprising:

-   -   providing a mapping M₁₂ from a schema S₁ to a schema S₂, said        mapping M₁₂ relating the schema S₂ to the schema S₁, said schema        S₁ and schema S₂ each comprising one or more elements, said        mapping M₁₂ being expressed in terms of at least one        second-order nested tuple-generating dependency (SO nested tgd);    -   providing a mapping M₂₃ from the schema S₂ to a schema S₃, said        mapping M₂₃ relating the schema S₃ to the schema S₂, said schema        S₃ comprising one or more elements, said mapping M₂₃ being        expressed in terms of at least one SO nested tgd; and    -   generating a mapping M₁₃ from the schema S₁ to the schema S₃,        said mapping M₁₃ relating the schema S₃ to the schema S₁, said        mapping M₁₃ being expressed in terms of at least one SO nested        tgd that does not expressly recite any element of the schema S₂,        said generating the mapping M₁₃ comprising generating the        mapping M₁₃ as a composition of the mappings M₁₂ and M₂₃,        wherein at least one schema of the schemas S₁ and S₂ comprises        at least one complex type expression nested inside another        complex type expression.

The present invention provides a method for generating a schema mapping,said method comprising:

-   -   providing a mapping M₁₂ from a schema S₁ to a schema S₂, said        mapping M₁₂ relating the schema S₂ to the schema S₁, said schema        S₁ and schema S₂ each comprising one or more elements, said        mapping M₁₂ being expressed in terms of at least one        second-order nested tuple-generating dependency (SO nested tgd);    -   providing a mapping M₂₃ from the schema S₂ to a schema S₃, said        mapping M₂₃ relating the schema S₃ to the schema S₂, said schema        S₃ comprising one or more elements, said mapping M₂₃ being        expressed in terms of at least one SO nested tgd; and    -   generating a mapping M₁₃ from the schema S₁ to the schema S₃,        said mapping M₁₃ relating the schema S₃ to the schema S₁, said        mapping M₁₃ being expressed in terms of at least one SO nested        tgd that does not expressly recite any element of the schema S₂,        said generating the mapping M₁₃ comprising generating the        mapping M₁₃ as a composition of the mappings M₁₂ and M₂₃,        wherein the mapping M₁₃ defines the composition of the mappings        M₁₂ and M₂₃ with respect to a transformation semantics.

The present invention advantageously provides a compositional mappingthat applies to XML schemas.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1 and 2 depict diagrammatically how mapping composition can beapplied for target evolution and source evolution, respectively, inaccordance with embodiments of the present invention.

FIG. 3 illustrates two instances over a schema, in accordance withembodiments of the present invention.

FIG. 4 shows a pictorial view of how formulas in a schema mapping fromschema S₁ to schema S₂ relate elements in schema S₁ and elements inschema S₂, in accordance with embodiments of the present invention.

FIG. 5 depicts schemas illustrating a Skolem function, in accordancewith embodiments of the present invention.

FIG. 6 depicts a mapping that illustrates the chase process forconstructing a target instance, in accordance with embodiments of thepresent invention.

FIG. 7 illustrates transforming an intermediate instance into a finalinstance in Partitioned Normal Form (PNF), in accordance withembodiments of the present invention.

FIG. 8 illustrates a sequence of schema mappings, in accordance withembodiments of the present invention.

FIGS. 9A-9C are flow charts depicting an algorithm for determining acomposition of schema mappings with respect to relationship semantics,in accordance with embodiments of the present invention.

FIG. 10 illustrates de-nesting rewrite rules, in accordance withembodiments of the present invention.

FIG. 11 illustrates union separation, record projection, and casereduction rewrite rules, in accordance with embodiments of the presentinvention.

FIG. 12 is a flow chart depicting an algorithm for determining acomposition of schema mappings with respect to transformation semantics,in accordance with embodiments of the present invention.

FIG. 13 illustrates a computer system for determining sequentialcomposition of schema mappings, in accordance with embodiments of thepresent invention.

DETAILED DESCRIPTION OF THE INVENTION

A schema is a set of rules describing how stored information isorganized. The stored information may be in the form of data structures(e.g., databases, tables, files, etc.). A schema mapping from a firstschema to a second schema may define a composition of two sequentialmappings with respect to a relationship semantics or to a transformationsemantics. A schema mapping pertaining to a relationship semanticsdescribes relationships between the first schema and the second schema.A schema mapping pertaining to a transformation semantics describes atransformation of data from the first schema to the second schema.

The present invention addresses the problem of sequential composition oftwo schema mappings. Given a mapping M₁₂ describing a relationshipbetween a schema S₁ and a schema S₂, and a mapping M₂₃ describing arelationship between schema S₂ and schema S₃, the present inventionderives a mapping M₁₃ that describes the implied relationship betweenschema S₁ and S₃, without referring to schema S₂ in the mapping M₁₃.This problem is of importance in areas such as information integrationand schema evolution, where tools are needed to manipulate and reasonabout mappings between relational database schemas and/or mappingbetween Extensible Markup Language (XML) schemas. Mappings betweenschemas may be expressed as constraints in a logic-based language; theseconstraints specify how data stored under one schema relate to datastored under another schema.

The present invention describes a system that implements two possiblesemantics for the result of composing two sequential mappings that areboth useful in practice. These two semantics are a relationshipsemantics and a transformation semantics. The relationship semantics ismore general than the transformation semantics and handles compositionof mappings when mappings are used to describe arbitrary relationshipsbetween schemas. The transformation semantics applies only when mappingsare used to describe transformations for exchanging data from a sourceschema to a target schema. The present invention provides algorithms forcomposing sequential schema mappings under both the relationshipsemantics and the transformation semantics. In effect, this disclosureclaims two different systems for sequential composition of schemamappings: one supporting the relationship semantics, and one supportingthe transformation semantics.

The remainder of the Detailed Description of the Invention is dividedinto the following sections:

-   -   1. APPLICATIONS    -   2. BASIC TERMS, CONCEPTS, AND NOTATION; and    -   3. ALGORITHM 1: COMPOSITION WITH RESPECT TO RELATIONSHIP        SEMANTICS    -   4. ALGORITHM 2: COMPOSITION WITH RESPECT TO TRANSFORMATION        SEMANTICS    -   5. COMPUTER SYSTEM        1. APPLICATIONS

Applications of the present invention comprise schema evolution,optimization of data transformation flows, and mapping deployment.

1.1 Schema Evolution

Schema mappings are abstractions or models for the more concrete scriptsor queries that can be used to transform data from one schema toanother. In other words, given a schema mapping M₁₂ from S₁ to S₂, ahuman or a software tool can generate a script or query or program totransform a database or a document instance that conforms to schema S₁into a database or document instance that conforms to schema S₂.However, when schema evolution occurs, such as when schema S₁ changes toa new schema S₁′ or schema S₂ changes to a new schema S₂′, the earlierscripts or queries or programs from S₁ to S₂ are no longer applicableand new scripts or queries ones are generated.

The present invention can be used in solving the above problem byemploying any one of the two mapping composition algorithms claimed.FIGS. 1 and 2 depict diagrammatically how mapping composition can beapplied for target evolution and source evolution, respectively, inaccordance with embodiments of the present invention. FIG. 1 shows thecase of target schema evolution (the target schema S₂ changes to S₂′),while FIG. 2 shows the case of source schema evolution (the sourceschema S₁ changes to S₁′).

In FIG. 1 for target schema evolution, the first step after the mappingM₁₂ has been previously specified is to generate a schema mappingM_(22′) from S₂ to S₂′. The second step is to apply mapping compositionfor M₁₂ and M_(22′) to obtain a new schema mapping M_(12′). This newschema mapping M_(12′) can then be used to guide the generation of thenew scripts, queries or programs that transform data from S₁ to the newschema S₂′.

In FIG. 2 for source schema evolution, the first step after the mappingM₁₂ has been previously specified is to generate a schema mappingM_(1′1) from S₁′ to S₁. The second step is to apply mapping compositionfor M_(1′1) and M₁₂ to obtain a new schema mapping M_(1′2). This newschema mapping M_(1′2) can then be used to guide the generation of thenew scripts, queries or programs that transform data from the new schemaS₁′ to S₂.

Both source schema evolution and target schema evolution may beimplemented with respect to a transformation semantics.

1.2 Optimization of Data Transformation Flows

A data transformation flow is a sequence or, more generally, a directedgraph, where the nodes of the graph are schemas and the edges aretransformations or mappings between the schemas. Data transformationflows are the common abstraction for ETL (extract-transform-load)commercial systems. The methods of the current invention can be used tocompose two schema mappings that appear sequentially in the datatransformation flow and, thus, eliminate intermediate stages in thegraph, with obvious performance benefits.

Furthermore, the repeated application of composition can yield anend-to-end schema mapping that eliminates all the intermediate nodes inthe graph. This results in a high-level view of the entire datatransformation flow; this high-level view, in turn, may allow forsubsequent be re-optimization of the data flow graph into a different,but equivalent, data flow graph, with better performance.

1.3 Mapping Deployment

Schema mappings may be designed at a logical level. More concretely, aschema mapping M_(ST) may be designed between two logical schemas S andT. At deployment time however, such mapping needs to be deployed(usually by a different person) between two physical schemas S′ and T′that are not exactly the same as the logical schemas. Once mappingsbetween the physical and the logical schemas are obtained (e.g.,mappings M_(S′S) from S′ to S, and M_(TT′) from T to T′), the methods ofthe current invention can be applied to generate a physical schemamapping M_(S′T′) from S′ to T′ by composing M_(S′S) with the “logical”mapping M_(ST) and then with M_(TT′).

2. BASIC TERMS, CONCEPTS, AND NOTATION

2.1 Schemas and Instances

The schemas S₁, S₂, and S₃ can be any schema expressed in a language ofnested elements of different types that include atomic types, recordtypes, choice types and set types respectively corresponding to atomicelements, record elements, choice elements, and set elements. Thislanguage is called herein the nested relational schema language and canencode relational database schemas as well as XML and hierarchicalschemas.

Atomic types (or primitive types) expressions are the usual data types:String, Int, Float, etc. Record types are used to encapsulate togethergroups of elements which in turn can be atomic types or complex typesexpressions. A complex type expression is a non-atomic type (e.g., aset, record, or choice type) expression. For example, RCD [ssn: Int,name: String] represents a record type whose components are ssn (socialsecurity number) and name. Another example is RCD [ssn: Int, person: RCD[name: String, address: String]], which denotes a record with twocomponents: ssn, of integer type, and person, of record type (with twocomponents: name and address, of string type). Set types are used torepresent collections of elements or records. For example, SET of RCD[ssn: Int, person: RCD [name: String, address: String]] will represent acollection of records, where each record will denote a person. Choicetypes are used to represent elements that can be one of multiplecomponents. For example, CHOICE [name: String, full_name: RCD[firstName: string, lastName: String]] will represent elements that caninclude either a name component (of type String), or of a full_namecomponent (which itself consists of firstName and lastName).

A schema is a collection of roots, that is, names with associated nestedtypes. The following are examples of nested relational schemas S₁, S₂,and S₃:  S₁ = {src₁: RCD [students: SET of RCD [sid: Int,  name: String, course: String,  grade: Char ]  ],  src₂: RCD [students: SET of RCD[sid: Int,  name: String,  course_code: Int  ],  courses: SET of RCD[course_code: Int,  course: String,  evaluation_file: String  ]  ] } S₂= {tgt: RCD [students: SET of RCD [sid: Int,  name: String,  courses:SET of RCD [course: String,  eval_key: Int ] ],  evaluations: SET of RCD[eval_key: Int, grade: Char, evaluation_file: String ] ] } S₃ ={new_tgt: RCD [students: SET of RCD [sid: Int,  results: SET of RCD [eval_key: Int ] ], evaluations: SET of RCD [eval_key: Int,  grade: Char, evaluation_file: String ] ] }

In this example, the schemas represent information about students(student id or sid, name), courses they take, and the results (grade,evaluation_file) that are assigned to students for each course. Thefirst schema S₁ contains two different roots, src₁ and src₂, which arecomplementary sources of information about students. The second schemaS₂ is a reorganization of the first schema that merges the informationunder one root, tgt. The third schema S₃ has one root new_tgt and can bethought of an evolution of the second schema S₂, where individualcourses are no longer of interest but the results still are.

Given a schema, any element of set type that is not nested inside otherset type (directly or indirectly) is referred to as a top-level set-typeelement. For example, in the schema S₂ above, “students” and“evaluations” are top-level set-type elements, while “courses” is not(since it is nested inside the set type associated with “students”).

Given a schema S with roots r₁ of type T₁, . . . , r_(n) of type T_(n),an instance I over S is a collection of values v₁, . . . , v_(n) for theroots r₁, . . . , r_(n) of S, such that for each k from 1 to n, v_(k) isof type T_(k). For example, FIG. 3 illustrates two instances over theschema S₂ shown supra, in accordance with embodiments of the presentinvention. In FIG. 3, the instances I and I′ appear on the left andright, respectively.

2.2 Schema Mappings

Schema mappings are used to specify relationships between schemas. Aschema mapping between schema S₁ and schema S₂ specifies how a databaseor document instance conforming to S₁ relates to a database or documentinstance conforming to S₂. Additionally, a schema mapping can be seen asa specification of how a database or document instance conforming toschema S₁ can be transformed into a database or document instanceconforming to schema S₂.

Schema mappings are expressed in a constraint language calledsecond-order nested tuple-generating dependencies (or SO nested tgds). ASO nested tgd may be characterized by at least one schema of the schemasS₁ and S₂ comprising at least one complex type expression nested insideanother complex type expression.

Each SO nested tgd includes one or more formulas. Each formula includesclauses such as: a for clause, an exists clause, a where clause, and awith clause. The for clause identifies source tuples to which theformula applies. The exists clause identifies tuples that must exist inthe target. The where clause describes constraints on the tuples ofsource and/or target. The with clause describes how values in fields ofsource and target tuples are matched.

The following is an example of a schema mapping, M₁₂, between theschemas S₁ and S₂ that were illustrated before. M₁₂ includes twoformulas, m₁ and m₂, each formula being an SO nested tgd.m₁: for (s in src₁.students)exists (s′ in tgt.students) (c′ in s′.courses) (e′ in tgt.evaluations)where c′.eval_key=e′.eval_keywith s.sid=s′.sid and s.name=s′.name ands.course=c′.course and s.grade=e′.gradem₂: for (s in src₂.students) (c in src₂.courses)where s.course_code=c.course_codeexists (s′ in tgt.students) (c′ in s′.courses) (e′ in tgt.evaluations)where c′.eval_key=e′.eval_keywith s.sid=s′.sid and s.name=s′.name andc.course=c′.course and c.evaluation_file=e′.evaluation_file

The meaning of the above formulas for m₁ and m₂ is as follows. FIG. 4also shows a pictorial view of how the above formulas for m₁ and m₂relate elements in schema S₁ and elements in schema S₂, in accordancewith embodiments of the present invention.

The formula m₁ is a constraint asserting, via the exists clause, thatfor each tuple s that appears in the set “students” under src₁ (of S₁),there exists a tuple s′ in the set “students” under tgt (of S₂), a tuplec′ in the set “courses” of s′, and a tuple e′ in the set “evaluations”under tgt.

Moreover, the where clause of m₁ specifies that the tuples c′ and e′must not be arbitrary but they are constrained so that they have thesame “eval_key” field.

Furthermore, the tuples s′, c′, and e′ are also constrained by the withclause of m₁: the “sid” and “name” fields of s′ must respectively equal(i.e., have the same value of) the “sid” and “name” fields of s, the“course” field of c′ must equal the “course” field of s, and the “grade”field of e′ must equal the “grade” field of s. The value of“evaluation_file” of e′ is left unspecified by m₁, due to the fact thatsrc₁ does not contain any element that corresponds to an evaluationfile.

In general, the with clause of a formula such as m₁ will contain asequence of equalities relating elements in a source schema such as S₁with elements (not necessarily with the same name) in a target schemasuch as S₂. In the case of m₁, these equalities are shown pictorially inFIG. 4 as the set of arrows grouped under the name “m₁”.

The formula m₂ is a similar constraint asserting, via the exists clause,that for each tuple s that appears in the set “students” under src₂ (ofS₁) and for each tuple c that appears in the set “courses” under src₂,where s and c satisfy the condition that they have the same“course_code” value (as stated in the first where clause of m₂), theremust exist a tuple s′ in the set “students” under tgt (of S₂), a tuplec′ in the set “courses” of s′, and a tuple e′ in the set “evaluations”under tgt.

As in m₁, the where clause of m₂ specifies that the tuples c′ and e′must not be arbitrary but they are constrained so that they have thesame “eval_key” field. Furthermore, the tuples s′, c′, and e′ are alsoconstrained by the with clause of m₂: the “sid” and “name” fields of s′must equal the “sid” and “name” fields of s, the “course” field of c′must equal the “course” field of c, and the “evaluation_file” field ofe′ must equal the “evaluation_file” field of c. The “grade” of e′ isleft unspecified by m₂, due to the fact that src₂ does not contain anyelement that corresponds to a grade.

As a notational matter, clauses appearing in mapping such as M₁₂ areunderlined (e.g., for clause, exists clause, where clause, with clause)and may alternatively be denoted in upper case letters (e.g., FORclause, EXISTS clause, WHERE clause, WITH clause, respectively).

2.3 Mapping Language: General Syntax of SO Nested tgds

In general, given a schema S₁ (the source schema) and a schema S₂ (thetarget schema), a nested tgd (m) is a formula of the form:m: for (x₁ in g₁ ^(s)) . . . (x_(n) in g_(n) ^(s))where B₁(x₁, . . . , x_(n))exists (y₁ in g₁ ^(t)) . . . (y_(m) in g_(m) ^(t))where B₂(y₁, . . . , y_(m))with (e₁ ^(s)=e₁ ^(t)) and . . . and (e_(k) ^(s)=e_(k) ^(t))where:

1) x₁, . . . , x_(n), y₁, . . . , y_(n) are variables (other symbols forvariables are x, y, z, z₁, etc.).

2) e₁ ^(s), . . . e_(k) ^(s), e₁ ^(t), . . . , e_(k) ^(t) areexpressions, where in general expressions are defined by the followinggrammar: e::=x|r|e.A (i.e., an expression can be a variable, a schemaroot, or a record component of another expression). In the abovemapping, e₁ ^(s), . . . e_(k) ^(s) are source expressions, that is, theyare required to use only variables from the for clause (i.e., x₁, . . ., x_(n)) and schema roots from the source schema S₁. Moreover, e₁ ^(t),. . . e_(k) ^(t) are target expressions, that is, they are required touse only variables from the exists clause (i.e., y₁, . . . , y_(m)) andschema roots from the target schema S_(2.) Furthermore, all theexpressions that appear in the with clause must be of atomic type.

3) g₁ ^(s), . . . g_(n) ^(s), g₁ ^(t), . . . , g_(m) ^(t) aregenerators, where in general a generator is defined by the followinggrammar: g::=e₁|case e₂ of A (where e₁ is an expression of set type ande₂ is an expression of a choice type that must include the choice of anA component). In the above mapping, g₁ ^(s), . . . g_(n) ^(s) are sourcegenerators, that is, they are required to use only variables from thefor clause (i.e., x₁, . . . , x_(n)) and schema roots from the sourceschema S₁. Moreover, g₁ ^(t), . . . g_(k) ^(t) are target generators,that is, they are required to use only variables from the exists clause(i.e., y₁, . . . , y_(m)) and schema roots from the target schema S₂.Furthermore, for every i from 1 to n, the ith source generator in thefor clause can only use variables x₁, . . . , x_(i-1). Similarly, forevery j from 1 to m, the jth target generator in the exists clause canonly use variables y₁, . . . , y_(j-1).

4) B₁(x₁, . . . , x_(n)) and B₂(y₁, . . . , y_(m)) are predicates of theform (e₁=e₁′) and . . . and (e₁=e₁′) where e₁, e₁′, . . . , e₁, e₁′ areexpressions of atomic type. In the case of B₁(x₁, . . . , x_(n)), whichis called a source predicate, these expressions can only be sourceexpressions, while in the case of B₂(y₁, . . . , y_(m)), which is calleda target predicate, these expressions can only be target expressions.

The earlier formulas m₁ and m₂ are examples of nested tgds over theschema S₁ and schema S₂. Second-order nested tgds (or, SO nested tgds,in short) are defined as an extension of nested tgds in the followingway.

Each source expression that can appear in B₁ and each source expressione_(i) ^(s) that can appear in the with clause can now be a function termt, defined by the grammar t::=e|F(t) where e is an expression as before(over the source variables x₁, . . . , x_(n)) while F is a function name(out of a possibly infinite set of function names that are available).It is possible that one function name is shared among multiple nested SOtgds.

The function F is sometimes called a Skolem function and a correspondingfunction term F(t) is sometimes called a Skolem term. One reason forthis terminology is that the subsequent Algorithm 1, described infra,includes a step (step 31) that generates such function terms (Skolemterms) based on a procedure called Skolemization.

As an example, FIG. 5 depicts schemas S and T to illustrate a Skolemfunction, in accordance with embodiments of the present invention. Thefollowing set M_(ST) is a schema mapping comprising SO nested tgds overthe schemas S and T that both use the Skolem function F: M_(ST) = {for(t in Takes) exists (s in Student) with s.sid = F(t.name) and s.name =t.name for (t in Takes) exists (e in Enrolls) with e.sid = F(t.name) ande.course = t.course }

The source schema S includes a Takes table storing student names andcourses the students take. The schema T includes two separate tables: aStudent table, storing student ids and student names, and an Enrollstable relating student ids and courses. The mapping M_(ST) splits atuple (name, course) into two tuples of Student and Enrolls, onecontaining the name the other one containing the course. At the sametime, a Skolem function F is used to assign an “unknown” student id foreach student name. The Skolem function F is used consistently (i.e.,having the name as parameter) in both formulas to express the fact thata given student name is assigned the same student id in both tables.

2.4 Two Semantics of Mapping Composition

There are two semantics that can be associated with schema mappings andwith their composition. The first semantics, the relationship semantics,is more general, while the second semantics, the transformationsemantics is more suitable for certain specialized tasks, such as schemaevolution, optimization of data transformation flows, and also generatessimpler formulas.

2.4.1 Relationship Semantics

Schema mappings can be viewed as describing relationships betweeninstances over two schemas. Under this view, the formulas that appear ina schema mapping are considered as inter-schema constraints. Moreconcretely, given a schema mapping M₁₂ between a schema S₁ and a schemaS₂, one can define the setRel (M ₁₂)={(I ₁ , I ₂)|I ₁ is an instance over S ₁ , I ₂ is an instanceover S ₂, and (I ₁ , I ₂) satisfies M ₁₂}  (1)

This set Rel (M₁₂), called the binary relation of M₁₂ or therelationship induced by M₁₂, contains all the “valid” pairs of instances(I₁, I₂), where “valid” means pairs of instances (I₁, I₂) that satisfyall the constraints that appear in M₁₂. Given two schema mappings M₁₂,from schema S₁ to schema S₂, and M₂₃ from schema S₂ to another schemaS₃, the composition Rel (M₁₂) ° Rel (M₂₃) of their induced relationshipsis defined as the composition of the two binary relations Rel (M₁₂) andRel (M₂₃):Rel (M ₁₂) ∘ Rel (M ₂₃)={(I ₁ , I ₃)|I ₁ is an instance over S ₁ , I ₃is an instance over S ₃, and there is an instance I ₂ over S ₂ such that(I ₁ , I ₂) satisfies M ₁₂ and (I ₂ , I ₃) satisfies M ₂₃}  (2)Then, by definition, a schema mapping M₁₃ defines the composition ofschema mappings M₁₂ and M₂₃, under the relationship semantics, if thefollowing equation is satisfied:Rel (M ₁₃)=Rel (M ₁₂) ° Rel (M ₂₃)  (3)

The above definition is for schemas that are non-nested and for schemamappings that are between non-nested schemas. (A non-nested schema is aschema as defined earlier, with the restriction that there are noset-type elements that are nested within other set-type elements.Schemas of relational databases are good examples of non-nestedschemas.)

In general, for schemas that are nested (that is, can have set-typeelements that are nested within other set-type elements) the abovedefinition is slightly modified by requiring that all instances (e.g.,I₁, I₂, I₃) that appear in equations (1) and (2) to be in a normal formthat is called partitioned normal form (or PNF).

An instance is said to be in the partitioned normal form (PNF) if therecannot be two records in the same set (where the set can occur in anyplace in the instance) such that the two records have the same atomicsub-components, component-wise, but different set-valued sub-components.

For example, the instance I on the left in FIG. 3 is not in PNF, becausethere are three records that have the same atomic components (001 andMary for the “sid” and “name” components, respectively) but differentsets of courses. In contrast, the instance I′ on the right in FIG. 3 isin PNF (intuitively, all the courses for 001 and Mary have been mergedunder a single set.)

PNF is a goodness criterion for instances; this criterion requires anatural form of data merging (grouping) to be satisfied by instances. Asan important special case, all non-nested instances are automatically inPNF.

2.4.2 Transformation Semantics

A schema mapping can be viewed as describing a process (data movement)in which a target instance is materialized, given a source instance Iand the schema mapping. More concretely, a schema mapping M₁₂ fromschema S₁ to schema S₂ defines a function that, given an instance I₁over S₁, computes an instance I₂ over S₂:I ₂ =M ₁₂(I ₁)  (4)

The formal definition of the function is as follows. First, all theformulas that appear in M₁₂ are Skolemized, by applying the proceduredescribed in detail infra for step 41 of Algorithm 2 (see FIG. 12described infra). Let the set of resulting Skolemized formulas from step31 of Algorithm 1 be M′₁₂. For the earlier example M₁₂={m₁, m₂}, theSkolemized set M′₁₂ includes the formulas m₁′ and m₂′ shown in thedescription of step 41 of Algorithm 2.

In the second step, the chase with SO nested tgds is used to construct atarget instance I₂′ based on I₁, using the mapping M′₁₂. At thebeginning of the chase, the target instance I₂′ is empty. Then, for eachSO nested tgd m′ in M′₁₂, and for each binding of the for clause of m′to tuples in the source instance I₁ such that the first where clause issatisfied, the chase adds tuples to the target instance I₂′ such thatthe exists clause of m′, its associated where clause and the with clauseare satisfied.

As an example, FIG. 6 depicts a mapping that illustrates the chaseprocess for constructing a target instance, in accordance withembodiments of the present invention. In particular, FIG. 6 depicts asource instance I₁ over the schema S₁ illustrated earlier. The sourceinstance I₁ contains two “student” tuples under the root “src₁”: [001,Mary, CS120, A] for Mary and [005, John, CS500, B] for John. Forsimplicity, FIG. 6 does not show the labels that are associated with thepreceding values; i.e., the labels “sid”, “name”, “course”, and “grade”.These labels are shown instead in the schema S₁, which is illustratedright near the instance. Furthermore, the instance I₁ contains two moretuples about Mary in the “students” set under the root “src₂” (i.e., thetuples [001, Mary, K7] and [001, Mary, K4]), and two more tuplesimplicitly about Mary in the “courses” set under the root “src₂” (i.e.,the tuples [K7, CS120, file01] and [K4, CS200, file07]).

For this example, the chase works as follows. The for clause of m₁′ canbe instantiated to the first tuple [001, Mary, CS120, A] in the“students” set of “src₁”. Then, to satisfy the exists clause, itsassociated where clause as well as the with clause of m₁′, three tuplesare added. First, a tuple [001, Mary, s1] is added to the “students” setof the root “tgt”. The value 001 is the “sid” component of this newtuple, Mary is the “name” component, and s₁ denotes the set value(initially empty) for the “courses” component. Then, a second tuple[CS120, E₁(001, Mary, CS120, A)] is added to the set s₁ of the previoustuple. Here, E₁(001, Mary, CS120, A) is a ground Skolem term obtained byapplying the Skolem function E₁ to the concrete values (001, Mary,CS120, A). This Skolem function and its arguments are specified by theformula m₁′ which is the result of the Skolemization step mentionedearlier. Finally, a third tuple [E₁(001, Mary, CS120, A), A, F(001,Mary, CS120, A)] is added to the “evaluations” set under the root “tgt”.Here, E₁(001, Mary, CS120, A) is the same ground Skolem term createdbefore, while F(001, Mary, CS120, A) is a new ground Skolem termobtained by applying the Skolem function F to the concrete values (001,Mary, CS120, A). Again, this Skolem function and its arguments arespecified by the formula m₁′ which is the result of the Skolemizationstep mentioned earlier.

The above described process is repeated for all the tuples in the sourceinstance I₁ and for all the formulas in M12′ (e.g., for m₂′ in additionto m₁′). At the end, each distinct ground function term is replaced by aunique value all throughout I₂. For example, the two occurrences ofE₁(001, Mary, CS120, A) are replaced by a value E₁ that is generated sothat it is different from every other value. A different ground functionterm, such as F(001, Mary, CS120, A), is replaced by a different value(e.g., F₁). The instance I₂′ depicted in FIG. 6 at the right of schemaS₂ is the result of applying the chase with the SO nested tgds in M₁₂′.

After the chase finishes, the resulting target instance I₂′ is furthertransformed (PNF-ized) into an instance I₂ that is in the partitionednormal form (PNF) described earlier. The PNF-ization identifies allrecords that appear in the same set (where the set can occur in anyplace in the instance) such that the records have the same atomicsub-components, component-wise, but different set-valued sub-components.For all such records, the set-valued components are unioned together.The process continues recursively until no such records can be found.

For the previous example, the instance I₂′ in FIG. 6 is not in PNF. Itcontains three tuples that have the same atomic components (001, Mary)but different sets: s₁, s₃ and s₄. To PNF-ize the instance, the threetuples are merged into one tuple, whose set is the union of s₁, s₃ ands₄. FIG. 7 illustrates PNF-izing the instance I₂′ into the finalinstance I₂, in accordance with embodiments of the present invention.The instance I₂ is M₁₂(I₁).

Then, by definition, a schema mapping M₁₃ defines the composition ofschema mappings M₁₂ and M₂₃, with respect to the transformationsemantics, if the following equation is satisfied:M ₁₃(I ₁)=M ₂₃(M ₁₂(I ₁)), for every instance I ₁ over S ₁  (5)

With respect to schema evolution (see Section 1.1), it is noted that fortarget schema evolution the mapping M₁₂ is provided or specified beforethe mapping M₂₃ is provided or specified, whereas for source schemaevolution the mapping M₁₂ is provided or specified after the mapping M₂₃is provided or specified.

3. Algorithm 1: Composition with Respect to Relationship Semantics

The input to Algorithm 1 is as follows. Schema mappings M₁₂ and M₂₃ areinputs expressed as constraints in the language of second-order nestedtuple-generating dependencies (or SO nested tgds). The schema mappingM₁₂ is a set of SO nested tgds that relate a schema S₁ and a schema S₂.Similarly, schema mapping M₂₃ is a set of SO nested tgds that relate aschema S₂ to a schema S₃.

FIG. 8 illustrates a sequence of schema mappings M₁₂ and M₂₃ toillustrate an example of input to the Algorithm 1, in accordance withembodiments of the present invention. M₁₂ is a mapping from S₁ to S₂,and M₂₃ is a mapping from S₂ to S₃, where S₁, S₂, and S₃ are the threeschemas illustrated earlier. M₁₂ is the set {m₁, m₂} of formulasdescribed earlier, and M₂₃ comprises the following formula m₃:m₃: for (s in tgt.students) (c in s.courses) (e in tgt.evaluations)where c.eval_key=e.eval_keyexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s.sid=s′.sid and e.eval_key=e′.eval_keye.grade=e′.grade and e.evaluation_file=e′.evaluation_file

The output to Algorithm 1 is as follows. Schema mapping M₁₃, comprisingnested tgds that relate schema S₁ and schema S₃, is an outputrepresenting the composition of M₁₂ and M₂₃ under the relationshipsemantics.

FIGS. 9A-9C (collectively, “FIG. 9”) are flow charts depicting steps31-34 of Algorithm 1 for determining a composition of schema mappingswith respect to relationship semantics, in accordance with embodimentsof the present invention. In the discussion of FIG. 9, the term “targetelement” refers to an element of schema S₂.

In step 31, each SO nested tgd m of M₁₂ is Skolemized by assigning aSkolem function term to each target atomic element whose value is notdetermined by formula m. A target atomic element X is said to be notdetermined by formula m if X is an atomic component of a tuple of S₂that is asserted in the exists clause of formula m but X does not appearin any equality with a source atomic element in the with clause offormula m.

For example, for the formula m₁ of the earlier schema mapping M₁₂, thetarget atomic element e′.evaluation_file, which is a component of thetuple e′ asserted in the exists clause of m₁, is not determined by m₁,since e′.evaluation_file does not appear in any equality in the withclause of m₁. As additional examples, for the same formula m₁, thetarget atomic elements c′.eval_key and e′.eval_key are components of thetuples c′ and e′ in the exists clause of m₁, but are not determined bym₁ (i.e., there is an equality that relates c′.eval_key and e′.eval_keyin the where clause of m₁, but not in the with clause of m₁).

For each formula m, Skolemization adds an equality in the with clause ofm for each target atomic element that is not determined by m. Thisadditional equality equates the target atomic element with a Skolem termthat is constructed by creating a new function symbol and applying it toa list of arguments consisting of all the source atomic elements thatappear in the with clause of m. The resulting formula is another SOnested tgd m′. If two target atomic elements are constrained to be equalby the where clause of m, then the same Skolem term (i.e., the samefunction symbol) will be used in m′.

As an example of Skolemization, the formulas m₁ and m₂ of the earlierschema mapping M₁₂ are respectively transformed into the followingformulas m₁′ and m₂′ (which are also SO nested tgds). The addedequalities in the with clauses m₁′ and m₂′ are shown in italics.m₁′: for (s in src₁.students)exists (s′ in tgt.students) (c′ in s′.courses) (e′ in tgt.evaluations)where c′.eval_key=e′.eval_keywith s.sid=s′.sid and s.name=s′.name ands.course=c′.course and s.grade=e′.grade andE₁(s.sid, s.name, s.course, s.grade)=c′.eval_key andE₁(s.sid, s.name, s.course, s.grade)=e′.eval_key andF(s.sid, s.name, s.course, s.grade)=e′.evaluation_filem₂′: for (s in src₂.students) (c in src₂.courses)where s.course_code=c.course_codeexists (s′ in tgt.students) (c′ in s′.courses) (e′ in tgt.evaluations)where c′.eval_key=e′.eval_keywith s.sid=s′.sid and s.name=s′.name andc.course=c′.course and c.evaluation_file=e′.evaluation_file andE₂(s.sid, s.name, c.course, c.evaluation_file)=c′.eval_key andE₂(s.sid, s.name, c.course, c.evaluation_file)=e′.eval_key andG(s.sid, s.name, c.course, c.evaluation file)=e′.grade

Note that for both m₁′ and m₂′, since the eval_key component of c′ isconstrained to be equal to the eval_key of e′ (in the originalformulas), the same Skolem term is being used for the two components(E₁(s.sid, s.name, s.course, s.grade) in m₁′, and E₂(s.sid, s.name,c.course, c.evaluation_file) in m₂′). On the other hand, theevaluation_file component of e′ and the grade component of e′ are notequal to anything in the where clause of m₁′ and m₂′, respectively;therefore evaluation_file and grade are used in unique Skolem termsF(s.sid, s.name, s.course, s.grade) and G(s.sid, s.name, c.course,c.evaluation_file), respectively.

In step 32 for each target element E in schema S₂ that is of set type, arule R_(E) is computed such that R_(E) describes how data instances forthe element E can be created based on the Skolemized SO nested tgdsrelevant for E obtained in step 31. Each such rule R_(E) includes aunion of a plurality of query terms that creates data instances for theelement E based on all the skolemized SO nested tgds that are relevantfor E (i.e., each Skolemized SO nested tgd recites E in its existsclause).

For example, the element “students” in the earlier schema S₂ is of a settype. Moreover, among the two Skolemized SO nested tgds, m₁′ and m₂′,that are generated in step 31, both m₁′ and m₂′ are relevant for“students” since “students” is recited in the exists clause of both m₁′and m₂′. Hence a rule that includes a union of two query terms isgenerated for “students”:R_(students)=for (s in src₁.students)return [sid=s.sid, name=s.name, courses=R_(courses)(s.sid, s.name)]∪for (s in src₂.students) (c in src₂.courses)where s.course_code=c.course_codereturn [sid=s.sid, name=s.name, courses=R_(courses)(s.sid. s.name)]

Each query term joined by the union operator ∪ in R_(students) includesa for clause that is the same as the for clause in the corresponding SOnested tgd. Furthermore, if the SO nested tgd includes a where clausefollowing immediately the for clause, then this where clause is alsoincluded in the query term. In addition, each query term has a returnclause that specifies how to construct the atomic components of astudent record, based on the with clause of the corresponding SO nestedtgd. For example, in the first query term, the sid component is to beconstructed by taking the value of s.sid, where s represent a studentrecord in the source. The correct expressions (e.g., s.sid) are decidedbased on the with clause of the corresponding Skolemized SO nested tgd,which specifies what the values of the target atomic components shouldbe.

In addition, in the return clause, each component that is of set type isconstructed by invoking a rule that is similar to the above ruleR_(students) but is parameterized by the values of the atomic componentsof the record. For example, the “courses” component is constructed byinvoking a rule R_(courses) with parameters s.sid and s.name. Suchinvocation constructs one set of course records for each differentcombination of student id and student name.

The rule for “courses”, given the above two Skolemized SO nested tgds,is shown below. Again, as before, both m₁′ and m₂′ are relevant (theirexists clauses both contain “courses”). Therefore, two query termsappear in the union. Furthermore, each query term has a filteringcondition in its where clause so that it generates course records onlyfor students whose sid and name values match the values of theparameters (1 ₁ and 1 ₂).R_(courses) (1 ₁, 1 ₂)=for (s in src₁.students)where s.sid=1 ₁ and s.name=1 ₂return [course=s.course, eval_key=E₁(s.sid, s.name, s.course, s.grade)]∪for (s in src₂.students) (c in src₂.courses)where s.course_code=c.course_code and s.sid=1 ₁ and s.name=1 ₂return [course=c.course, eval_key=E₂(s.sid, s.name, c.course,c.evaluation_file)]

In general, such a parameterized rule is constructed for every elementof set type that is nested inside another set type. Rules for top-levelset-type elements do not need to be parameterized. As another example ofsuch top-level rule, the rule for “evaluations” is listed below.R_(evaluations)=for (s in src₁.students)return [eval_key=E₁(s.sid, s.name, s.course, s.grade),grade=s.grade,evaluation_file=F(s.sid, s.name, s.course, s.grade)]∪for (s in src₂.students) (c in src₂.courses)where s.course_code=c.course_codereturn [eval_key=E₂(s.sid, s.name, c.course, c.evaluation_file),grade=G(s.sid, s.name, c.course, c.evaluation_file),evaluation_file=c.evaluation_file]

In step 33, a composition-with-rules algorithm is executed to replace inmapping M₂₃ all references to schema S₂ with references to schema S₁.Let a rule set R₁₂ be the set of all the rules that result afterapplying step 32. The composition-with-rules algorithm includes thefollowing steps 33A, 33B, and 33C.

In step 33A, a mapping holder R is initialized to be M₂₃. For theexample under discussion, R includes only one formula, m₃, shownearlier. Generally, R includes all formulas m comprised by mapping M₂₃.The formulas in the mapping holder R resulting from step 33A will betransformed by steps 33B and 33C to a form that has eliminated allreferences to S₂ to become the output mapping M₁₃ of Algorithm 1.

In step 33B, for each SO nested tgd m in R, each top-level set-typeelement that is mentioned in formula m is replaced by the correspondingrule (for that element) in the rule set R₁₂. For the example underdiscussion, the formula m₃ is transformed into the following formula:m′₃: for (s in <body of R_(students)>) (c in s.courses) (e in <body ofR_(evaluations)>)where c.eval_key=e.eval_keyexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s.sid=s′.sid and e.eval_key=e′.eval_keye.grade=e′.grade and e.evaluation_file=e′.evaluation_filewhere the notation <body of R_(students)> represents a short-hand forthe union of query terms that is on the right-hand side of the equalsymbol in the above definition of R_(students). The notation <body ofR_(evaluations)> is a similar short-hand for the case ofR_(evaluations).

In step 33C, each formula in R that results after step 33B is rewrittenby using at least one rewrite rule of four rewrite rules. These rewriterules, called de-nesting, union separation, record projection and casereduction, are illustrated in FIGS. 10 and 11, in accordance withembodiments of the present invention. FIG. 10 illustrates de-nesting,and FIG. 11 illustrates union separation, record projection, and casereduction.

For a given formula m in R, the de-nesting rule removes inner nestedexpressions such as the inner nested expression {for (y₁ in Y₁) . . .(y_(k) in Y_(k)) where B return r} in FIG. 10 via replacement of eachinner nested expression by the notation e[g→r] in a relevant clause(s)of formula m to simulate the functionality of each replaced inner nestedexpression and via insertion of the generators (y₁ in Y₁) . . . (y_(k)in Y_(k)) in the outer for clause. The de-nesting rule assumes that allthe variables in the inner for_where_return_expression are differentfrom (i.e., do not conflict with) the outer variables. This will beaccomplished by renaming all the inner variables before applying everyde-nesting step.

The union separation rule separates the N expressions (N at least 2)joined by the union operator (∪) in a given formula m in R into Nformulas. In FIG. 11, for example, the formula comprising the twoexpressions X₁ and X₂ joined as X₁∪X₂ is separated into the two formulasshown, namely a first formula comprising X₁ and a second formulacomprising X₂.

The record projection rule projects expression e_(i) in a formula m byreplacing each appearance of [ . . . , L_(i)=e_(i), . . . ]. L_(i) withe_(i), as shown in FIG. 11.

The case reduction rule has two cases. First, whenever a formula mincludes an appearance of <L_(i)=e_(i)>. L_(i) this appearance of<L_(i)=e_(i)>. L_(i) is replaced by ei. However, if the formula mincludes an appearance of <L_(i)=e_(i)>. L_(j), where L_(i) and L_(j)are different, then the formula is abandoned. The reason for thisabandoning of the formula is that the formula can never be satisfied,since the choice expression <L_(i)=e_(i)> includes a component (element)called L_(i) but the larger expression tries to obtain a componentcalled L_(j). Since the formula cannot be satisfied, there is no need toinclude the formula in the final result of composition. Here, thenotation ⊥ is used to denote formally an abandoned formula (or, formulathat cannot be satisfied).

The rewriting process is as follows. While there is some formula m′ in Rfor which some rewrite rule applies to it (or to some subexpression ofit), the method of the present invention applies the rewrite rule to m′,adds the resulting formulas (if not equal to ⊥) to R and removes m′ fromR.

For example, for the above m′₃, since <body of R_(students)> is theunion of two query terms (which we can be denoted here, in short, T₁ andT₂), it follows that the union separation rule is applicable and resultsin the following two formulas:for (s in T₁) (c in s.courses) (e in <body of R_(evaluations)>)where c.eval_key=e.eval_keyexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s.sid=s′.sid and e.eval_key=e′.eval_keye.grade=e′.grade and e.evaluation_file=e′.evaluation_filefor (s in T₂) (c in s.courses) (e in <body of R_(evaluations)>)where c.eval_key=e.eval_keyexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s.sid=s′.sid and e.eval_key=e′.eval_keye.grade=e′.grade and e.evaluation_file=e′.evaluation_file

The mapping m′₃ is removed and the above two formulas are added to R.The rewriting process continues by trying to rewrite the first of thetwo formulas. Since T₁ is the expressionT₁=for (s₁ in src₁.students)return [sid=s₁.sid, name=s₁.name, courses=R_(courses) (s₁.sid, s₁.name)](where the variable s has been renamed to s₁), it follows that thede-nesting rule can be applied and the result is:for (s₁ in src₁.students)(c in [sid=s₁.sid, name=s₁.name, courses=R_(courses) (s₁.sid,s₁.name)].courses)(e in <body of R_(evaluations)>)where c.eval_key=e.eval_keyexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith [sid=s₁.sid, name=s₁.name, courses=R_(courses) (s₁.sid,s₁.name)].sid=s′.sidand e.eval_key=e′.eval_keye.grade=e′.grade and e.evaluation_file=e′.evaluation_file

Note that the original variable s has been replaced, in each of the twoplaces where it occurred, by the record expression [sid=s₁.sid,name=s₁.name, courses=R_(courses) (s₁.sid, s₁.name)] that is in thereturn clause of T₁. Now the record projection rule can be applied twiceand the above formula is replaced by the following:for (s₁ in src₁.students)(c in R_(courses) (s₁.sid, s₁.name))(e in <body of R_(evaluations)>)where c.eval_key=e.eval_keyexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s₁.sid=s′.sid and e.eval_key=e′.eval_keye.grade=e′.grade and e.evaluation_file=e′.evaluation_fileThis formula is the same as:for (s₁ in src₁.students)(c in <body of R_(courses) where l₁ is replaced by s₁.sid and l₂ isreplaced by s₁.name>)(e in <body of R_(evaluations)>)where c.eval_key=e.eval_keyexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s₁.sid=s′.sid and e.eval_key=e′.eval_keye.grade=e′.grade and e.evaluation_file=e′.evaluation_file

This process continues then by applying the union separation rule forthe body of R_(courses), then de-nesting, and so on.

The algorithm terminates with a set M₁₃ of SO nested tgds that mentiononly schema S₁ and schema S₃. This set M₁₃ is the schema mapping that isthe composition of M₁₂ and M₂₃, with respect to the relationshipsemantics. For the example, M₁₃ includes the below formulas (and more):(m₁₃) for (s₁ in src₁.students) (s₂ in src₁.students) (s₃ insrc₁.students)where E₁(s₂.sid, s₂.name, s₂.course, s₂.grade)=E₁(s₃.sid, s₃.name,s₃.course, s₃.grade)and s₁.sid=s₂.sid and s₁.name=s₂.nameexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s₁.sid=s′.sid and E₁(s₃.sid, s₃.name, s₃.course,s₃.grade)=e′.eval_keyand s₃.grade=e′.grade andF(s₃.sid, s_(3.)name, s₃.course, s₃.grade)=e′.evaluation_file(m′₁₃) for (s₁ in src₁.students) (s₂ in src₁.students) (s₃ insrc₂.students) (c in src₂.courses)where s₃.course_code=c.course_codeand E₁(s₂.sid, s₂.name, s₂.course, s₂.grade)=E₂(s₃.sid, s₃.name,s₃.course, s₃.grade)and s₁.sid=s₂.sid and s₁.name=s₂.nameexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s₁.sid=s′.sid and E₁(s₃.sid, s₃.name, s₃.course,s₃.grade)=e′.eval_keyand G(s₃.sid, s₃.name, c.course, c.evaluation_file)=e′.grade andc.evaluation_file=e′.evaluation_file

In consideration of the preceding discussion, step 33C is implementedfor each formula in R that results after step 33B, by performing theprocess depicted in the flow chart of FIG. 9C for each formula. The flowchart of FIG. 9C comprises steps 51-57.

Step 51 determines whether union separation is applicable. If step 51determines that union separation is not applicable, then step 53 is nextexecuted. If step 51 determines that union separation is applicable thenthe union separation is executed in step 52, followed by iterativere-execution of the loop of steps 51-52 until union separation no longerapplies in step 51 and the process next executes step 53.

Step 53 determines whether a de-nesting of a complex type expression isapplicable. If step 53 determines that said de-nesting is notapplicable, then step 55 is next executed. If step 53 determines thatsaid de-nesting is applicable then said de-nesting is executed in step54, followed by iterative re-execution of the loop of steps 53-53 untilde-nesting no longer applies in step 53 and the process next executesstep 55. Each de-nesting de-nests an outermost complex type expressionin the formula being processed.

Step 55 performs all case reductions and record projections that areapplicable.

Step 56 determines whether the formula being processed comprises anexpression of the form R(e₁, . . . , e_(n)), wherein R(e₁, . . . ,e_(n)) is a rule of the ruleset R₁₂ that results from step 32 of FIG.9A. If step 56 determines that the formula being processed does notcomprise such an expression of the form R(e₁, . . . , e_(n)) then theprocess ends. If step 56 determines that the formula being processedcomprises such an expression of the form R(e₁, . . . , e_(n)) then step57 replaces R(e₁, . . . , e_(n)) by the body of R such that theparameters of R (namely, l₁, . . . , l_(n)) are respectively replaced bye₁, . . . , e_(n). Step 57 was illustrated supra in an example in whichR_(courses) was replaced by the body of R_(courses), and the parametersl₁ and l₂ in the body of Rcourses were replaced by s₁.sid and s₁.name,respectively.

After step 57 is executed, the process re-executes the loop 51-57iteratively until there no longer remains an expression of the formR(e₁, . . . , e_(n)) and the process ends.

The set of all such formulas, when taken in their entirety, describesthe relationship between instances over S₁ and instances over S₂ that isimplied by the relationships induced by the two given schema mappings(from S₁ to S₂ and from S₂ to S₃, respectively), characterized by notmentioning schema S₂ (i.e., the elements of tgt.students,tgt.students.courses, and tgt.evaluations of schema S₂ are not recitedin formulas m₁₃ and m′₁₃ of mapping M₁₃).

As illustrated in the preceding examples, the particular rewrite rulesinvoked, and the order and number of times that said particular rewriterules are invoked, depends on the structure of each formula m in R, suchthat application of the particular rewrite rules to m eliminates allreferences to schema S₂.

4. ALGORITHM 2: COMPOSITION WITH RESPECT TO TRANSFORMATION SEMANTICS

The input to Algorithm 1 is as follows. Schema mappings M₁₂ and M₂₃ areinputs expressed as constraints in the language of second-order nestedtuple-generating dependencies (or SO nested tgds). The schema mappingM₁₂ is a set of SO nested tgds that relate a schema S₁ and a schema S₂.Similarly, schema mapping M₂₃ is a set of SO nested tgds that relate aschema S₂ to a schema S₃.

The output to Algorithm 1 is as follows. Schema mapping M₁₃, consistingof a set of SO nested tgds that relate schema S₁ and schema S₃, is anoutput representing the composition of M₁₂ and M₂₃ under thetransformation semantics.

FIG. 12 is a flow chart depicting steps 41-45 of Algorithm 2 fordetermining a composition of schema mappings with respect totransformation semantics, in accordance with embodiments of the presentinvention. In the discussion of FIG. 12, the term “target element”refers to an element of schema S₂.

In FIG. 12, steps 41-43 for Algorithm 2 are the same as steps 31-33(including steps 31A, 31B, 31C), respectively of FIG. 9 for Algorithm 1.

In step 44, each of the SO nested tgds in the mapping holder R thatresult after step 43, with no remaining reference to schema S₂, isreduced by the applying the following procedure. Each equality thatappears in the first where clause of the SO nested tgd and involvesSkolem terms is processed as follows. If the equality is of the formF(t₁, . . . , t_(n))=F(t′₁, . . . , t′_(n)) (i.e., equating Skolem termswith the same Skolem function) then the equality of said form isreplaced by t₁=t′₁ and . . . and t_(n)=t′_(n) to construct a new SOnested tgd that replaces the old SO nested tgd. If this new SO nestedtgd still contains some equality involving Skolem terms in the firstwhere clause, then step 44 is applied again (recursively) to the new SOnested tgd. If the equality is of the form F(t₁, . . . , t_(n))=G(t′₁, .. . , t′_(n)) (i.e., equating Skolem terms with different Skolemfunctions) or of the form e=F(t₁, . . . , t_(n)) then eliminate thecurrent SO nested tgd from any further processing. In other words, step44 is applied recursively to eliminate all Skolem term equalitiess inwhich different Skolem functions are equated or in which a Skolem termis equated to a non Skolem term.

For example, to apply step 44 to the SO nested tgd m₁₃ shown earlier,the equality E₁(s₂.sid, s₂.name, s₂.course, s₂.grade)=E₁(s₃.sid,s₃.name, s₃.course, s₃.grade) is replaced by s₂.sid=s₃.sid ands₂.name=s₃.name and s_(2.)course=s₃.course and s₂.grade=s₃.grade. Theresulting formula is:for (s₁ in src₁.students) (s₂ in src₁.students) (s₃ in src₁.students)where s₂.sid=s₃.sid and s₂.name=s₃.name ands₂.course=s₃.course and s₂.grade=s₃.gradeand s₁.sid=s₂.sid and s₁.name=s₂.nameexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s₁.sid=s′.sid and E₁(s₃.sid, s₃.name, s₃.course,s₃.grade)=e′.eval_keyand s₃.grade=e′.grade andF(s₃.sid, s₃.name, s₃.course, s₃.grade)=e′.evaluation_file

Since none of the equalities in the first where clause contain a Skolemterm, step 44 finishes here, for the above SO nested tgd (that is, thereis no need for a recursive application of step 44).

As another example, the formula m′₁₃ shown earlier is eliminated in step44, because m′₁₃ contains in its first where clause the equalityE₁(s₂.sid, s_(2.)name, s₂.course, s₂.grade)=E₂(s₃.sid, s₃.name,s₃.course, s₃.grade), between two Skolem terms with different functions,E₁ and E₂.

Step 45 minimizes the resulting SO nested tgds in the mapping holder R.For each SO nested tgd that results after step 44, step 45 finds anequivalent SO nested tgd that has a minimal number of logically requiredvariables in the for clause of each SO nested tgd. For example, theabove formula can be shown to be equivalent to the following formulathat uses just one variable in the for clause:for (s₃ in src₁.students)exists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s₃.sid=s′.sid and E₁(s₃.sid, s₃.name, s₃.course,s₃.grade)=e′.eval_keyand s₃.grade=e′.grade andF(s₃.sid, s₃.name, s₃.course, s₃.grade)=e′.evaluation_file

Intuitively, the reason why the earlier formula is equivalent to theabove formula is that all that is needed from variables s₁ and s₂ in thewith clause is s₁.sid. But this can be replaced with s₃.sid, since thesetwo expressions are equal. Furthermore, whenever the variable s₃ can beinstantiated to a tuple, the variables s₁ and s₂ can also beinstantiated to the same tuples such that their pattern of equalities issatisfied. Hence, the formula is the same whether or not we assert s₁and s₂ in the for clause (s₃ is enough).

The formal procedure for minimization of SO nested tgds is similar tothe minimization of conjunctive queries in database query optimization.

The output of the algorithm is the set of all minimized SO nested tgdsthat result after step 45. As an example, the above SO nested tgd ispart of the final result of composing schema mappings M₁₂ and M₂₃, underthe transformation semantics. Another formula that is also part of thisfinal result is:for (s in src₂.students) (c in src₂.courses)where s.course_code=c.course_codeexists (s′ in new_tgt.students) (r′ in s′.results) (e′ innew_tgt.evaluations)where r′.eval_key=e′.eval_keywith s.sid=s′.sid and E₂(s.sid, s.name, c.course,c.evaluation_file)=e′.eval_keyand G(s.sid, s.name, c.course, c.evaluation_file)=e′.grade andc.evaluation_file=e′.evaluation_file

The set of all such formulas in M₁₃ capture all the different ways oftransforming the data from schema S₁ to schema S₃ that are equivalent toall the different ways of first transforming data from schema S₁ to theintermediate schema S₂ (as dictated by M₁₂) followed by all thedifferent ways of transforming the resulting data from the intermediateschema S₂ to schema S₃ (as dictated by M₂₃).

It is noted that, although Algorithm 2 is Algorithm 1 with added steps44 and 45, the result of Algorithm 2 is a set of formulas that issimpler (has fewer and also simpler formulas) than the result ofAlgorithm 1. If the main intention of schema mappings is for datatransformation (and not for describing relationships between schemas)Algorithm 1 is to be preferred to Algorithm 2. However, if the primaryintention of schema mappings is representing relationships betweenschemas, then Algorithm 1 is to be preferred, since Algorithm 2 is notguaranteed to return a schema mapping with equivalent relationshipsemantics.

It is noted that Algorithms 1 and 2 are each applicable to a variety ofschemas, including relational database schemas, XML schemas,hierarchical schemas, etc.

5. Computer System

FIG. 13 illustrates a computer system 90 used for generating a schemamapping, in accordance with embodiments of the present invention. Thecomputer system 90 comprises a processor 91, an input device 92 coupledto the processor 91, an output device 93 coupled to the processor 91,and memory devices 94 and 95 each coupled to the processor 91. The inputdevice 92 may be, inter alia, a keyboard, a mouse, etc. The outputdevice 93 may be, inter alia, a printer, a plotter, a computer screen, amagnetic tape, a removable hard disk, a floppy disk, etc. The memorydevices 94 and 95 may be, inter alia, a hard disk, a floppy disk, amagnetic tape, an optical storage such as a compact disc (CD) or adigital video disc (DVD), a dynamic random access memory (DRAM), aread-only memory (ROM), etc. The memory device 95 includes a computercode 97. The computer code 97 includes an algorithm for generating aschema mapping. The processor 91 executes the computer code 97. Thememory device 94 includes input data 96. The input data 96 includesinput required by the computer code 97. The output device 93 displaysoutput from the computer code 97. Either or both memory devices 94 and95 (or one or more additional memory devices not shown in FIG. 13) maybe used as a computer usable medium (or a computer readable medium or aprogram storage device) having a computer readable program code embodiedtherein and/or having other data stored therein, wherein the computerreadable program code comprises the computer code 97. Generally, acomputer program product (or, alternatively, an article of manufacture)of the computer system 90 may comprise said computer usable medium (orsaid program storage device).

Thus the present invention discloses a process for deploying orintegrating computing infrastructure, comprising integratingcomputer-readable code into the computer system 90, wherein the code incombination with the computer system 90 is capable of performing amethod for generating a schema mapping.

While FIG. 13 shows the computer system 90 as a particular configurationof hardware and software, any configuration of hardware and software, aswould be known to a person of ordinary skill in the art, may be utilizedfor the purposes stated supra in conjunction with the particularcomputer system 90 of FIG. 13. For example, the memory devices 94 and 95may be portions of a single memory device rather than separate memorydevices.

While embodiments of the present invention have been described hereinfor purposes of illustration, many modifications and changes will becomeapparent to those skilled in the art. Accordingly, the appended claimsare intended to encompass all such modifications and changes as fallwithin the true spirit and scope of this invention.

1. A method for generating a schema mapping, said method comprising:providing a mapping M₁₂ from a schema S₁ to a schema S₂, said mappingM₁₂ relating the schema S₂ to the schema S₁, said schema S₁ and schemaS₂ each comprising one or more elements, said mapping M₁₂ beingexpressed in terms of at least one second-order nested tuple-generatingdependency (SO nested tgd); providing a mapping M₂₃ from the schema S₂to a schema S₃, said mapping M₂₃ relating the schema S₃ to the schemaS₂, said schema S₃ comprising one or more elements, said mapping M₂₃being expressed in terms of at least one SO nested tgd; and generating amapping M₁₃ from the schema S₁ to the schema S₃, said mapping M₁₃relating the schema S₃ to the schema S₁, said mapping M₁₃ beingexpressed in terms of at least one SO nested tgd that does not expresslyrecite any element of the schema S₂, said generating the mapping M₁₃comprising generating the mapping M₁₃ as a composition of the mappingsM₁₂ and M₂₃, wherein at least one schema of the schemas S₁ and S₂comprises at least one complex type expression nested inside anothercomplex type expression.
 2. The method of claim 1, where the mapping M₁₃defines the composition of the mappings M₁₂ and M₂₃ with respect to arelationship semantics.
 3. The method of claim 1, where the mapping M₁₃defines the composition of the mappings M₁₂ and M₂₃ with respect to atransformation semantics.
 4. The method of claim 3, wherein providingthe mapping M₁₂ is performed before providing the mapping M₂₃ isperformed, in accordance with a target schema evolution.
 5. The methodof claim 3, wherein providing the mapping M₁₂ is performed afterproviding the mapping M₂₃ is performed, in accordance with a sourceschema evolution.
 6. The method of claim 1, where said generating themapping M₁₃ as a composition of the mappings M₁₂ and M₂₃ comprises:Skolemizing each SO nested tgd of the mapping M₁₂; computing a ruleR_(E) for each element E in the schema S₂ that is of set type, whereinthe rule R_(E) describes how data instances of the element E can becreated based on the Skolemized SO target tgds that are relevant for theelement E, and wherein R₁₂ is a rule set denoting the set of all of saidrules R_(E); and replacing all references to the schema S₂ in themapping M₂₃ with references to the schema S₁ using the rule set R₁₂,wherein said replacing all references to the schema S₂ in the mappingM₂₃ converts the mapping M₂₃ to the mapping M₁₃.
 7. The method of claim6, wherein said replacing references to the schema S₂ in the mapping M₂₃is implemented by executing a composition-with-rules algorithmcomprising: initializing a mapping holder R to the mapping M₂₃; aftersaid initializing, for each SO nested tgd m in R, replacing eachtop-level set-type element in the SO nested tgd m by a correspondingrule in the rule set R₁₂; and after said replacing each top-levelset-type element in the SO nested tgd m in R, rewriting each SO nestedtgd m in R using at least one rewrite rule selected from the groupconsisting of a union separation rewrite rule, a record projectionrewrite rule, a case reduction rewrite rule, and combinations thereof.8. The method of claim 7, wherein said rewriting comprises for each SOnested tgd m in R: executing at least one union separation; andde-nesting at least one complex type expression after said executing atleast one union separation is performed.
 9. The method of claim 6,wherein the method further comprises: after said replacing allreferences to the schema S₂ in the mapping M₂₃: eliminating in R allSkolem term equalities in which different Skolem functions are equatedor in which a Skolem term is equated to a non Skolem term; and aftersaid eliminating: reducing the number of variables to a minimal numberof logically required variables in a FOR clause of each SO nested tgd inR.
 10. The method of claim 1, wherein the schemas S₁, S₂, and S₃ areExtensible Markup Language (XML) schema.
 11. A computer program product,comprising a computer usable medium having a computer readable programcode embodied therein, said computer readable program code comprising analgorithm adapted to implement the method of claim
 1. 12. A method forgenerating a schema mapping, said method comprising: providing a mappingM₁₂ from a schema S₁ to a schema S₂, said mapping M₁₂ relating theschema S₂ to the schema S₁, said schema S₁ and schema S₂ each comprisingone or more elements, said mapping M₁₂ being expressed in terms of atleast one second-order nested tuple-generating dependency (SO nestedtgd); providing a mapping M₂₃ from the schema S₂ to a schema S₃, saidmapping M₂₃ relating the schema S₃ to the schema S₂, said schema S₃comprising one or more elements, said mapping M₂₃ being expressed interms of at least one SO nested tgd; and generating a mapping M₁₃ fromthe schema S₁ to the schema S₃, said mapping M₁₃ relating the schema S₃to the schema S₁, said mapping M₁₃ being expressed in terms of at leastone SO nested tgd that does not expressly recite any element of theschema S₂, said generating the mapping M₁₃ comprising generating themapping M₁₃ as a composition of the mappings M₁₂ and M₂₃, wherein themapping M₁₃ defines the composition of the mappings M₁₂ and M₂₃ withrespect to a transformation semantics.
 13. The method of clam 12,wherein providing the mapping M₁₂ is performed before providing themapping M₂₃ is performed, in accordance with a target schema evolution.14. The method of clam 12, wherein providing the mapping M₁₂ isperformed after providing the mapping M₂₃ is performed, in accordancewith a source schema evolution.
 15. The method of clam 12, where saidgenerating the mapping M₁₃ as a composition of the mappings M₁₂ and M₂₃comprises: Skolemizing each SO nested tgd of the mapping M₁₂; computinga rule R_(E) for each element E in the schema S₂ that is of set type,wherein the rule R_(E) describes how data instances of the element E canbe created based on the Skolemized SO target tgds that are relevant forthe element E, and wherein R₁₂ is a rule set denoting the set of all ofsaid rules R_(E); replacing all references to the schema S₂ in themapping M₂₃ with references to the schema S₁ using the rule set R₁₂,wherein said replacing all references to the schema S₂ in the mappingM₂₃ converts the mapping M₂₃ to the mapping M₁₃; after said replacingall references to the schema S₂ in the mapping M₂₃: eliminating in R allSkolem term equalities in which different Skolem functions are equatedor in which a Skolem term is equated to a non Skolem term; and aftersaid eliminating: reducing the number of variables to a minimal numberof logically required variables in a FOR clause of each SO nested tgd inR.
 16. The method of claim 15, wherein said replacing references to theschema S₂ in the mapping M₂₃ is implemented by executing acomposition-with-rules algorithm comprising: initializing a mappingholder R to the mapping M₂₃; after said initializing, for each SO nestedtgd m in R, replacing each top-level set-type element in the SO nestedtgd m by a corresponding rule in the rule set R₁₂; and after saidreplacing each top-level set-type element in the SO nested tgd m in R,rewriting each SO nested tgd m in R. using at least one rewrite ruleselected from the group consisting of a union separation rewrite rule, arecord projection rewrite rule, a case reduction rewrite rule, andcombinations thereof.
 17. The method of claim 16, wherein said rewritingcomprises for each SO nested tgd m in R: executing at least one unionseparation; and de-nesting at least one complex type expression aftersaid executing at least one union separation is performed.
 18. Themethod of clam 12, wherein the schemas S₁, S₂, and S₃ are ExtensibleMarkup Language (XML) schema.
 19. The method of clam 12, wherein theschemas S₁, S₂, and S₃ are relational database schema.
 20. A computerprogram product, comprising a computer usable medium having a computerreadable program code embodied therein, said computer readable programcode comprising an algorithm adapted to implement the method of claim12.